Cortical topographic motifs emerge in a self-organized map of object space

The human ventral visual stream has a highly systematic organization of object information, but the causal pressures driving these topographic motifs are highly debated. Here, we use self-organizing principles to learn a topographic representation of the data manifold of a deep neural network representational space. We find that a smooth mapping of this representational space showed many brain-like motifs, with a large-scale organization by animacy and real-world object size, supported by mid-level feature tuning, with naturally emerging face- and scene-selective regions. While some theories of the object-selective cortex posit that these differently tuned regions of the brain reflect a collection of distinctly specified functional modules, the present work provides computational support for an alternate hypothesis that the tuning and topography of the object-selective cortex reflect a smooth mapping of a unified representational space.


Fig. S1. SOM training over time.
Training (Initialization and Fine-tuning) stages of the SOM. In each row we visualize the simulated cortex in context of the input data's PC-space -the green points depict the location of images in the input feature space (dnn features) and the black connected points depict the tuning of SOM map units in this PC-space. We also visualize the animacy and size preference, and face-and sceneselectivity on this simulated cortex for every stage in the training process.   A clearer distinction between big and small images is seen on the preference and the t-maps in this trained feature space. See supplementary analysis section for details and discussions of these results.

Fig. S5. Face and Place Selectivity with D' and Selectivity Index comparisons.
Category selectivity for faces and scenes within the two stimuli sets, measured using Selectivity Index (SI) and D-prime measure.

Comparing the large-scale organization between models (deepnet and SOM) trained on different visual diets -imagenet vs ecoset. For each visual experience, we visualize: (i) Representational Geometry of images (stimuli from (9)) in the deep net feature space. (Left) RDMs based on correlational distance. (Right) Bar plots showing the image-level pairwise correlational distance between dnn features of big and small animals, and dnn features of big and small objects. (ii) (Left) 4-way preference map on the simulated cortex among big objects, small objects, big animals, and small animals and (Right) Bar plots showing the image-level pairwise euclidean distance between simulated cortical activations of big and small animals, and simulated cortical activations of big and small objects using the same stimuli as in (i). (iii)
Heatmaps showing the absolute difference of mean simulated cortical activations based on size for animals (i.e. big animals vs. small animals) and objects (i.e. big objects vs. small objects) (iv) Animacy (animals vs. objects) and Size (small vs. big) preferences on the simulated cortex. (v) Face-, Scene-, and Body-selectivity on the simulated cortex measured using the d-prime measure. Stimulus from (59) were used to compute the selectivity maps.

Supplementary Analysis -Pixel Space Representation
In Supplementary set of analyses, we examined whether the distinctions for animate vs inanimate and big vs small objects emerge in a pixel space representation. That is, without the deep-neural network untangling that has been learned in a pre-trained Alexnet, do we already see these distinctions in pixel space, and to what degree?
To test this, we fit an SOM to the pixel space of the validation images of the ImageNet database, yielding an input matrix of 50,000 images x 150,528 dimensions (reflecting the RGB encoding of each image in 224 pixels * 224 pixels * 3 channels). The pixel-SOM shape was automatically set to 14 x 28 units. This is the largest map size that is <= 400 units, that preserves the ratio of the first two eigenvalues of the sample of 400 images). Next, we probed the pixel-SOM with the same animal and object images (Konkle & Caramazza, 2013 (9)). The results are shown in Supplementary Figure 4.
We find that in the preference maps for the pixel-SOM, most units had a stronger activation on average to animal images (Supplementary Figure 4A). When we quantify the separability between animals and object images with an independent 2-sample t-statistic, however, we find only weak separability between these two classes of images, as evident in the plotted t-maps (e.g., the average absolute value of the t-statistics across the map is |t|=1.29, the max is t=2.38). For comparison, when we compute the t-maps over the relu7 space, we see clearer separability between animal vs object image responses in each unit (average |t|=6.54, max |t|=16.12; (Supplementary Figure 4B). Thus, the animal and object images are dramatically more entangled in the pixel space than in the trained relu-7 space.
Next, considering the big vs small distinction, we find that in the pixel-SOM, most units have a stronger activation on average to small objects (Supplementary Figure 4C). However, as before, there is only weak separability between big vs small entities (average |t| = 1.98, max |t|=2.37). In contrast, the relu7 space t-maps show clear separability between animal and object images (average |t|=3.25, max |t|=11.97; (Supplementary Figure 4D). Thus, images of big and small entities are also dramatically more entangled in the pixel space than in the trained relu-7 space.
In doing these analyses, we noticed the localizer images were dramatically out-of-distribution compared to the ImageNet validation images in pixel space (e.g. the localizer images are on a white background, affecting many of the pixel dimensions). So, in a exploratory analysis, we additionally trained a pixel-SOM directly on the localizer image set (240 images x 150,528 dimensions, yielding a 20x20 localizer-pixel-SOM). And, we probed it with the same images. We again find the animal preferences and small preferences were found across the entire map, and again that t-maps showed minimal separability between these classes (animacy: average |t|=2.19, max |t| = 2.91, size: average |t| = 2.94, max |t| = 3.22). In contrast, an SOM trained only on these 240 images as they are embedded in the relu7 space (240 x 4096) showed clear animacy and object size organizations (animacy: average |t|=7.59, max |t| -17.2; size: average |t| = 3.64, max |t| = 8.04). Thus, even in the context of just the localizer images, the animacy distinction is untangled in the pixel space of this image set, which has become untangled by the relu7 stage of the deep neural network.
A further question we wondered was whether is it always the case that animals and small object images a stronger preference in pixel space, or are these preferences a property of this particular set of 240 color images? We next probed the ImageNet-Pixel-SOM with the gray-scaled versions of these same images, as well as the texforms (stimulus set from Long et al. (11)). Here, units showed both animate and inanimate preferences, and big and small preferences, on average, though again with all t<s very low. Thus, this additional exploratory analysis implies that it is not a general case that animals and small objects are more extreme in pixel space.
Taken together, these analyses demonstrate that the distinctions between animals and objects and big and small entities are very entangled in pixel space. And, the reformatting of image information through hierarchical stages of a deep neural network is critical to see these distinctions emerge.